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Abstract 



In parametric sequence alignment, optimal alignments of two sequences are com- 
_ puted as a function of the penalties for mismatches and spaces, producing many 

^ ' different optimal alignments. Here we give a 3/(2^'^7r-^''^)n-^'^ + 0{n^'^ logn) lower 

\^ . bound on the maximum number of distinct optimal alignment summaries of length 

Q I n binary sequences. This shows that the upper bound given by Gusfield et. al. is 

"q ■ tight over all alphabets, thereby disproving the "-^/n conjecture". Thus the maxi- 

mum number of distinct optimal alignment summaries (i.e. vertices of the alignment 
polytope) over all pairs of length n sequences is 0(n^/^). 
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K^ , Finding optimal alignments of DNA or amino acid sequences is often used in 

biology to measure sequence similarity (homology) and determine evolutionary 
history. For a review of many problems relating to sequence alignment, see 
[5] and [8]. Here we deal with the question of how many different alignment 
summaries can be considered optimal for a given pair of sequences (though 
many different alignments may correspond to the same alignment summary). 

Given sequences S, T, an alignment F is a pair (S", T') formed by inserting 
spaces, "— ", into S and T. In each position, there is a match, in which S" 
and T' have the same characters, a mismatch, in which they have different 
characters, or a space in one of the sequences. Then for any alignment, we 
have an alignment summary {w,x,y), where w is the number of matches, x 
is the number of mismatches, and y is the number of spaces in one of the 
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sequences. Notice that n = w + x + y^ where n is the length of both sequences. 
Given a pair of sequences, the convex hull of all such points (w, x, y) is called 
their alignment polytope. 

We can score alignments by weighting each component. Since we have w + x + 
y = n, we can normalize so that the weight of w is 1, the weight of x is —a 
and the weight of y is —jS. Then 

score(^a,/3) {u!, x,y) = w ~ ax — (3y. 

A sequence is optimal if it maximizes this score. For biological relevance, 
we will only consider non-negative a and /3, which penalizes mismatches and 
spaces. It is also possible to weight other parameters, such as gaps (consecutive 
spaces) or mismatches between certain subsets of characters. Here we will 
consider only the two parameter model described above. 

Example 1 For the sequences 111000 and 010110, we have an alignment 

- 1 - 1 1 
10 111 

which has 3 matches, 1 mismatch, and 2 spaces. So for a given a and [3 the 
score of the this alignment would be 3 — a — 2p. 

Any value of a and (3 will give an optimal alignment. Given a and (3, we can use 
the Needleman-Wunsch algorithm to effectively compute optimal alignments 
[6J (for a review, see [H Ch. 2, 7]). Unfortunately, different choices for a, (3 give 
different optimal alignments, leaving the problem of which weights to use. To 
resolve this. Waterman, Eggert, and Lander proposed parametric alignment, 
in which the weights a, (3 are viewed as parameters rather than constants [^. 
Since alignments are discrete, this creates a partition of the {a, (3) plane into 
optimality regions, so that for each region R, there is an alignment that is op- 
timal for all the points on its interior and R is maximal with this property [1]. 
Each optimality region is a convex cone in the plane |1], [8[ Ch. 8]. Notice that 
because our scoring function is linear, the vertices of the alignment polytope 
are our optimal alignment summaries. Also, if we let P^y be the convex hull 
of all {x, y) occurring in alignment summaries, then 

score(^a,i3) = w — ax — Py = n — {a + l)x — (/? + l)y, 

since n = w + x + y. Thus the vertices of P^y will be those that minimize {x, y) ■ 
{a + 1, (3 + 1) for some (a, /3), thus maximizing score(^a,i3) and corresponding to 
optimal alignments [8] . From this we can see that the the decomposition of the 
{a, (3) plane into optimality regions can be obtained by shifting the normal 
fan of Pxy by (—1,-1) [H Ch. 8]. The goal of parametric alignment is to 
find all these optimality regions with their corresponding optimal alignments. 



The Needleinan-Wunsch algorithm is also an effective method of computing 
the alignment polytope of sequences (and thus optimal alignments and the 
decomposition of the {a, (3) plane) [S]. 

Gusfield et. al. showed that for two sequences of length n, the number of op- 
timality regions of the {a, (3) plane (equivalently the number of vertices in 
their alignment polytope) is 0(n^/^)m. Indeed for larger dimensional mod- 
els (say with d free parameters), this bound was extended to 0(n'^~''^/'^^) 
by Fernandez-Baca et. al. [3] and improved to 0(n'^*^'^~^''/*^'^+^^) by Pachter 
and Sturmfels [7]. For d = 2, Fernandez-Baca et. al. refined this bound to 
3(n/27r)^/^ + 0(n^/^log(?2)) and showed it to be tight over an infinite alphabet 
[2]. They also provide a lower bound of ^(-y/n) over a binary alphabet. Using 
randomly-generated sequences, Fernandez-Baca et. al. observed that the aver- 
age number of optimality regions closely approximates ^/n. This led them to 
conjecture that, over a finite alphabet, the expected number of optimality re- 
gions is 6(A/n)[2]. The question remained of whether or not the upper bound 
of Gusfield et. al. was tight over a finite alphabet. For a discussion, see [Hi Ch. 
8], which conjectures that the maximum number of optimality regions induced 
by any pair of length-n binary strings is Q{y/n) [8]. Here we construct a coun- 
terexample to this conjecture, which together with the above upper bounds 
shows it instead to be B(n^/'^). Our main theorem is that Gusfield's bound is 
tight for binary strings. 

Theorem 2 (Main Theorem) The maximum number of optimality regions 
induced by binary strings of length n is Q{n^/^). 

Ideally, sequences would have few optimal alignments, making the "best" one 
more apparent. While this result may not tell us about the expected number 
of optimal alignments (or be biologically relevant), it does provide a worst 
case scenario for sequence alignment and show that the bound from [1] cannot 
be improved. Luckily, the bound is still sublinear. Indeed parametric sequence 
alignment can be practical and has been achieved for whole genomes [Ij. This 
paper is mainly motivated by [2], [1], and [8]. We largely follow their notation 
and presentation. 



2 Decomposing the (a,/5) plane 
2.1 Alignment Graphs 



We can represent every alignment of two length-n sequences as a path through 
their alignment graph. The graph can be thought of as an {n + 1) x (n + 1) grid, 
with rows and columns numbered consecutively from top to bottom (left to 
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Fig. 1. (Left):Above is the path corresponding to the aUgnment of 111000 and 
010111 given in example [H (-1-11000, 010111- -). The shaded regions denote pos- 
sible matches. 

(Right): Here are the optimality regions of the (a,/3) plane induced by 
the sequences 111000 and 010111. The alignments optimized in each re- 
gion are Ti = (111000,010111), Ta =(-1-11000, 010111- -), and 
Tg =(-1-11-000, 010111- - -). 

right), from to n [2j. An alignment path is a path on these vertices, starting 
at (0,0), ending at (n, n), and only moving down, right or diagonally down 
and to the right. Each path corresponds to a unique alignment. In this path, 
a move down (or left) corresponds to a space in the first (or second) sequence, 
and a diagonal move corresponds to a match or mismatch (depending on the 
characters). See Figure 12.11 for the alignment graph of our above example 
alignment. 



2.2 Optimality regions 



Gusfield et. al. observed that the boundaries between optimality regions in 
the (a,/3) plane must be lines passing through the point (—1, —1). 

Lemma 3 (Gusfield et. al., [4J) All optimality regions on the {a, (3) plane 
are semi-infinite cones, and are delimited by lines of the form (3 = c+ {c+ l)a 
for some constant c. 

In general, a boundary between two optimality regions consists of the (a,/?) 
for which the optimal sequences from each region have equal, optimal scores. 
Since 

score(-i-i){w, x,y) = w + x + y = n, 

for every w,x,y, each such line (specifically these boundary hues) must pass 
through the point (—1, —1). They also note that all of these boundary lines 
must intersect the non-negative /3-axis because none of them cross the positive 



a-axis [1]. This comes from observing that in any ahgnment, we can change 
a mismatch to a space (in each sequence) without affecting the number of 
matches. Thus all along the line /3 = 0, the optimal alignment will have the 
maximum number of matches possible, without regard to spaces (since those 
are not penalized). So no boundary line can separate the nonnegative a-axis 
into distinct optimality regions. Since all boundary lines must pass through 
the point (—1,-1) and cannot intersect the positive a-axis, we indeed have 
that 

Lemma 4 (Gusfield et. al., [3]) Each of the optimality regions must have 
nontrivial intersection with the non-negative [3-axis. That is, for any path V 
that is optimized by some (a,/5), there must he some (3' so that V is optimized 
byi0,/3'). 

This allows us to restrict our attention to optimality regions on the /?-axis. 
Then boundary regions are just points, (0,/5), for which consecutive opti- 
mal alignments have optimal score(o,/3). Note that alignments with summaries 
{wi,Xi,yi) and {w2,X2,y2) will have equal score(o,/3) when 

u!i - Pyi = W2- py2, 
meaning that 

_ AW7 W2 - Wi 

Ay ■ y2-yi' 

In order to find different optimality regions, we will find distinct -^ forming 
boundary points on the /?-axis. 



3 The Lower Bound 

For each 2 < r, define F^ as 

-Fr := {f < 1 : f is reduced and a + b = r}. 

Since a/b is reduced and a + b = r, a and b must be relatively prime to r. 
Then each number relatively prime to r will show up exactly once (in either 
the numerator or the denominator), so \Fr\ = 0(r)/2 for r > 2 where is the 
Euler totient function, and IF2I = |{1/1}| = 1. 

Let 



giving us I J'gl = i i;r=3 <Pir) + 1. 



r=2 



Fixing q, let ai/bi < 02/62 < . . . < am/bm = 1 be the elements of Tq. We're 
going to construct two sequences of length n = 4X]fc ^k, S = S1S2 ■ ■ ■ Sn and 
T = tit2 ■ ■ .tn- Since bk < ak + bk, this gives us 

mm s s 

k=l k=l r=2 r=2 



3.1 The Sequences 



Let's construct the first sequence, S. To start, let the first 61 + Oi elements 
of S be 0, followed by bi — ai I's. Then repeat for k > 1 (i.e. next place 
62 + ^2 O's followed by 62 — ^2 I's)- Notice that for each ak/b^ G J-'q, we use 
{bk + dk) + {bk — ttk) = 2bk places. To get the second half of the sequence, take 
the reverse complement of the first half (refiecting it and switching all the I's 
and O's). So 



More formally, define 

r m 

k=l k=r 

(So n = 2i{m) = 2j(l)). Then 

{0 for 1 < k < br + ttr 
1 for 6,. + a^ + 1 < /c < 26^ 

and 

{0 for 1 < k < br — ttr 
1 for br — dr + i ^ k < 2br. 

The second sequence, T, will just be n/2 I's followed by n/2 O's, that is, 

_ 1 for 1 < A; < n/2 
tk 

for n/2 + 1 < A; < n. 

Example 5 For q = 4, J^4 = {1/3, 1/2, 1/1}. Then n = 4(3 + 2 + 1) = 24. 
Our sequences are 

S = 000011000100 110111001111 



111111111111000000000000 




Fig. 2. The alignment graph for r4,r3,r2,ri (top to bottom) from the exam- 
ple above. The shaded regions denote possible matches. Note that for q = 4, 
|^4| = 3. 



m 



T = 111111111111 000000000000 



3.2 The Alignment Paths 



We are going to construct m + 1 alignment paths, F 



m+l; -•- m,5 



Fi. Let Fm+i 
be the path along the main diagonal (corresponding to the alignment with no 



spaces). To get F^, align the first j{ 



J2T=r 2&fc O's of S with spaces and 



align its remaining elements without spaces, ending by aligning the last j{r) 
O's of T with spaces. 



Note that because there are n/2 I's in both 5* and T, we'll have enough room 
to do this. In fact, in the last alignment, Fi, all the I's of S will be matched 
with all the I's of T. See Figure [3^2] for the graphs of the optimal alignments 
of our example. 




Fig. 3. Here is the decomposition of the (a, /3) plane given by the sequences in 
Example [5l Each optimality region is labeled with the alignment path Tr that it 
optimizes. 

3.3 Alignment Scores 



Let wl denote the number of matching I's in F^ and similarly w^ denote the 
number of matching O's in F,,, with Wr being the total number of matches. 
Note that 

wl — wl_^_l = br + a-r and w^ — w^^j^^ = —{h,. — a^). 

Since Wr = wl + w^, we have that 

Wr — Wr+l = ipr + Oj.) — (6,. — a^) = 20^. 

Let i/r denote the number of spaces in F^ (which equals j{r)). Then 

yr - Vr+i = Jir) - j(r + 1) = 26,.. 



Putting these together, we get that for every r, 

AWr Wr — Wr+1 ttr 

Ayr ' Vr - Vr+l K 



:i) 



3.4 Optimality 



We need to show that each of these paths is optimal for distinct optimality 
regions, which will be accomplished by the next two lemmata. 



Lemma 6 Let T be any alignment of S and T. Then for any (3 > 0, there is 
some Tr so that score(o,/3)(rr) > score(o,/3)(r). 



PROOF. Say that F has ahgninent path a and ahgnment summary {w, x, y). 
Let the coordinates of the ahgnment graph be (t, s), with (0, 0) starting in the 
upper left corner. Say that {n/2, n/2 + k) is the first time a meets the vertical 
line t = n/2. 

Because of the symmetry of our sequences, we can take k to be nonnegative 
(meaning that a hits the line t = n/2 below or at s = n/2). If a has k < 0, 
we can rotate our picture 180° to get another alignment path with the same 
summary and k > 0. 

So suppose k >0 and take r so that j(r + 1) < A; < j{r). 

(Case 1: k — j{r + 1) < h.,. — ar). 

Since there are only wIj^^ I's above s = n/2 + k, we have w^ < w^+i- Similarly, 
there are at most w^+i O's below s = n/2 + k, so w^ < w^_^i. Furthermore, 
by going through the point {n/2, n/2 + k), a must have at least k spaces, so 
y>k>j{r + l)= yr+i- Putting these together gives that for any (3 > 0, 

score(o,/3)(Fr+i) - score(o,/3)(F) = {wr+i - w) - Piyr+i -y) >0. 

Intuitively, F can have at most as many matches and must have at least as 
many spaces as Fj.+i, and thus cannot have a higher score. 

(Case 2: k- j{r + 1) > br - ar and (3 < I) 

There are w^ O's in S below s = n/2 + k, so we have w^ < w^.. In addition to 
the wIj^i I's in S above s = n/2+j{r + l), there are another k—j{r) + {br + ar) 
I's in S between s = n/2 + j{r + 1) and s = n/2 + k. So 

w^ < wl^i + k - j{r) + {br + ar) = wl + k - j{r), 

since wl^^ + {h,. + ar) = wl. Thus 

w = w'^ + w^ < w^. + wl + k — j{r) = Wr + k — j{r). (2) 

As is case 1, we have that y > k, so 

scor e(o,/3)(F^,) - score(o,/3)(F) = {wr - w) - l3{yr - y) 

>(j(r)-k)-(3ijir)~k) (by©) 

> (as /5 < 1) 



(Case 3: k- j{r + 1) > hr - a-r and 13 > I) 

We'll show that score(Q^p){Tm+i) > ■score(o,/3)(r). Remember that F^+i is the 
alignment with no spaces {ijm+i = 0), corresponding to the main diagonal of 
the alignment graph. Note for any r, 

•m 

Wr = Wm+1 + J2 2«fc' (3) 

k=r 

SO using equation ([2]) from case 2, we get 

m 

Wm+1 -w> j{r) - k-Yl 2afc- 

k=r 

As in previous cases, y > k. Then, 

score(o,/3)(r„+i) - score(o,/3)(r) = (w^+i - w) - P{ym+i - y) 

m 

>3{r)-k-Y.2ak + (3k (by©) 

(as (3>l) 





k=r 


>3{r)- 






k=r 


m 


m 


= >:26< 


:-)_2ak 


k=r 


k=r 


>0. 





Lemma [6] tells us that any optimality region has one of the Tr as an optimal 
alignment. Now we need to check that each score(o,/3)(rr) is optimized by a 
different region. To see this, we use equation ([1]) and following lemma. 

Lemma 7 (Fernandez-Baca, et. al., [2]) Let Fi, r2, • • • , F^ be paths in the 
alignment graph. Assume score(Fj) = Wi — Pyi, where yi > y2 > ■ ■ ■ > yq- Let 
Po = 0,pq = cx), and for r = 1, . . . ,g-l, Pr = {wr-Wr+i)/{yr-yr+i)- Suppose 
Po < Pi < ■ ■ ■ < Pq- Then for P G {Pr-i, Pr) and p ^ r, score(o,/3)(Fj.) > 
score(o,/3)(Fp). 

So each of the F^ do indeed represent each of the different optimality regions 
on the /5-axis, and thus in the (a, /3) plane. 



3.5 The Actual Lower Bound 



Theorem 8 The maximum number of optimality regions induced by any pair 
of length-n sequences is fi(n^/'^). 



10 



PROOF. Above we have constructed sequences of length n < 2X]r=2'^0('^) 
that gave m = ^ J2t=2 'Pij) optimahty regions. From analytic number theory, 
as calculated in [2], 

I 1 3 

m = - ^ 0(r) + 1 = —q^ + O(glogg), 

and 

9 4 



n 



< 2 Vr0(r) = —q^ + 0{qHogq). 
Then g > (^)^/^ + O(logn), meaning 



3 n^l^^Oi'n'lHogn). 



27/37^2/3 



With the upper bounds from |1] and ^, this gives 

Corollary 9 The maximum number of optimality regions over all pairs of 
length-n sequences is Q{n'^^^), and more specifically is between ^/^ 2/3 ^^^^ + 
0{n^/^ log n) and (2 J2/3 ^^^^ + 0{n^/^ log n) . 

It's unclear whether the current bounds on optimality regions for scoring with 
d > 2 parameters, 0{n^^'^~^'''^'^'^^'), are also tight or whether better upper 
bounds exist. Another interesting open question (perhaps with more practical 
relevance) is the order of the expected number of optimality regions, rather 
than the maximum. 
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